Report

Implementation: Select A' in the while loop, and input it into the update policy function. The update policy function obtains the Q(S′,A′) from the table, and returns Q(S,A) = Q(S,A) + α(R + γ∗Q(S′,A′) − Q(S,A)). Difference: In the first several episodes, we can see Q-learning holds better
maximum lifetime than SARSA does. This is reasonable because an off-policy algorithm is able to estimate the optimal policy faster than SARSA, an on-policy algorithm. Also, this is obviously observed in the first few thousands of iterations because SARSA must try out actions at different states while Q-learning follows a supposed optimal policy directly. After large amount of episodes, SARSA should have caught up Q-learning on maximum lifetime, however SARSA only caught up some lifetime in my result. Perhaps, the number of episodes is not enough.